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1 . Introduction 

Biological systems are inherently dynamic and gene expression levels may be 
temporally regulated for a wide range of reasons including the cell cycle, cir- 
cadian rythms, developmental processes or in response to stimuli (e.g. drug 
treatment or environmental stress) (Spellman et al., 1998; Wang and Kim, 2003; 
Calvano ct al., 200-5). Microarrays arc a high throughput assaying technique for 
measuring these expression levels of thousands of genes simultaneously. Each mi- 
croarray hybridisation provides a snapshot of expression levels at a single point 
in time; by carrying out scqiiential hybridisations on biological samples arising 
from the same source (e.g. a human patient), the evolution of these expression 
levels over time can then be elucidated. 

The resulting microarray time series give rise to data that possess certain 
characteristics which make their analysis particularly challenging. Specifically, 
due to the large number of genes under study simultaneously, the data is very 
highly dimensional and there are many more genes than there arc time points. 
Each time series will be replicated typically no more than ten times, and exper- 
iments with no replication are not uncommon. The number of genes will often 
number in the tens of thousands while there are rarely more than ten time points. 
Even with the falling cost of microarray technology, the limiting factor is often 
the ability to obtain biological samples which may be restricted due to ethical 
concerns or other practical, experimental issues. Other challenges include the 
fact that the data is noisy, with frequent missing observations, and individual 
heterogeneity. 

Our focus is on longitudinal study designs. In this type of microarray experi- 
ment, multiple biological units - for example human patients, individual mice or 
cell lines - are each repeatedly sampled over time to give a collection of observed 
time series for each gene under study. This type of biological replication is essen- 
tial for making inference about population parameters but is often overlooked 



in niicroarray studies duo to experimental issues. A longitudinal mieroarray ex- 
periment is described in Section 2 and provides the data for our case study. The 
purpose of the study was to follow twelve female and ten male adult human 
subjects over a period of 6 months, in order to characterise the change in gene 
expression levels over time in healthy humans. Figure 1 shows some of the raw 
data for a probe corresponding to the TMEFFl gene from this example data set 
where some of the characteristics discussed above can be seen to manifest them- 
selves. A key aspect of human data sets is that the gene expression levels are 
often collected with covariates - for example, the individual's age, sex and other 
phenotypic data such as height or weight may be recorded. In the case study, 
the individuals were stratified by age and gender which allows us to explore not 
only the evolution of gene expression levels over time but also which genes are 
differentially expressed between the two gender or age groups. 

When modelling experimental data arising from longitudinal mieroarray ex- 
periments there are three distinct challenges: (a) modelling each individual time 
series, across all genes and individuals, (b) accounting for the correlation between 
individuals on a gene by gene basis and (c) modelling the correlation between 
genes. Accounting for each of these sources of correlation - the temporal, the 
within-genc (between-individual) and the bctwecn-gene - is vital for obtaining 
better parameter estimates and avoiding a loss of power when testing for genes 
which are differentially or temporally expressed. With less than 10 timepoints, 
achieving (a) is not possible with standard time series analysis techniques - it 
is unlikely, for instance, that we would observe any periodicity. Instead, a field 
which has proven to be quite successful in this area is that of functional data anal- 
ysis (FDA). In the FDA paradigm, it is assumed that our observations are noisy 
realisations of an underlying smooth function of time which is to be estimated. 
These estimated functions, or curves, are then treated as the fundamental unit 
of data in any subsequent analysis. Formally, the signal-in-noise model assumed 
is that observation j/j taken at time ti is given by 

Vi = f{ti) + Ci (1) 

where /(•) is the function of interest to be estimated and Cj is an error term. 
Typically the infinite dimensional function /(•) is projected onto some finite 
dimensional basis using parameterisations such as splines, wavelets or fourier 
bases. In our discussion we will focus on splines in particular as these regularly 
occur in the literature in terms of both mieroarray and functional data analysis. 
For a thorough treatment of FDA, the monograph Ramsay and Silverman (2005) 
provides an excellent introduction. 

In a longitudinal study, for a particular gene, observations will be collected on 
not just a single function /(•), but a collection of n functions /i(-), i = 1, - ■ ■ ,n, 
one for each individual biological unit. Often the main quantity of interest is 
the population mean function /i(-) characterising the overall population gene 
expression level over time. In this ease we extend the signal-in-noise model (1) 
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so that the jth observation on individual at time Uj is given by 

Vij = l^iUj) + MUj) + e-ij (2) 

This is known as the functional mixed-effects model and is an extension of 
the standard linear mixed-effects model (Harville, 1977) where the fixed- and 
random-effects are both considered functions. Function /i(-) is treated as a fixed- 
effect as it is assumed to be some fixed, but unknown, population function to be 
estimated. In constrast, the functions fi{-),i = 1, • • • , n represent a random sam- 
ple from the population as a whole and are assumed to be i.i.d realisations of an 
underlying stochastic process. Model (2) has appeared in a number of different 
forms depending upon the exact parameterisation of the fixed- and random- 
effects. For instance, Guo (2002) models both as cubic smoothing splines while 
Rice and Wu (2001) prefer a B-spline representation. 

The task of handling correlations amongst the genes has, to date, generally 
been overlooked by researchers. It is a challenging, open problem to model 
both the between- and within-gene correlation simultaneously given the size 
of the data. Although it is well known that genes are co-regulated, for the 
sake of tractability the most common approach is to simply model each gene 
independently. In other words, given the framework outlined thus far, each gene 
would be modelled as a separate functional mixed-effects model. 

In this paper we propose a functional-mixed effects model and a framework 
for estimation and testing in one-sample problems. The model enables the esti- 
mation of a mean response curve with the inclusion of covariates, such as gender 
and sex, also modeled as time- varying smooth functions. We also show how a 
functional PCA can be applied to the estimated mean curves in order to identify 
the principal modes of fmictional variation in the data set, and visually represent 
the entire set of genes in a low-dimensional plot. 

The structure of the paper is as follow. Section 2 provides a description of a 
data set, previously collected and analysed by Karlovich et al. (2009), that we use 
here as a case study. The proposed model, inferential procedures and fimctional 
PCA are provided in Section 3. In Section 4 we present the experimenal results 
obtained in the context of our case study. In Section 5 we discuss how our 
methodology compares to related models that have appeared in the literature 
and compare our experimental results to that of the original study, as well as 
highlight some of the biological implications. Finally, we conclude in Section 6. 

2. Data description 

The data set used in our case study is taken from Karlovich et al. (2009). The 
purpose of the study was to characterise the gene expression levels of healthy 
human individuals over a period of 6 months. 22 subjects were studied, with 
gene expression levels assayed from blood samples at days 1, 14, 28, 90 and 180. 
One subject developed lung cancer during the course of the study and died prior 
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Fig. 1. Raw data for TMEFF1, females. Several key characteristics of the data can 
be observed: (1) irregularly spaced time points (2) missing data - individual 10 is only 
observed for the first three time points (3) significant individual heterogeneity (4) noisy 
observations 
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to Day 180, thus contributing only a partial time scries. All other individuals 
completed the study and were observed at all 5 time points. Twelve of the 
individuals were female and ten were male. In the original study, the subjects 
were divided into two age groups, with the younger group taken to be those 
subjects less than or equal to 55 years of age, and the older group those subjects 
over 55. 

In the original paper, the observation for a given gene on individual i at time 
t was modelled as 

= M + + figgendevi + l^aag^i + Pttimet + eu 

This is a standard linear mixed-effects model, /i is the average gene expression 
level across all individuals after controlling for gender, age and time effects, ai 
is an individual specific term allowing for a deviation in terms of the intercept 
of the model. The (3g, Pa and j3t parameters separate out the gender, age and 
time effects respectively while en is an error term. 

The model and study design permitted a wide range of biological issue to be 
explored. Using t-tests, the significance of the age and gender effects was de- 
termined. After correcting for multiple-testing by controlling the false discovery 
rate (FDR) using the procedure of Benjamini and Hochberg (1995), no genes 
showed a significant age effect. This was somewhat unexpected given previous 
studies (Eady et al., 2005; Whitney et al., 2003; Tang et al, 2004) but it was 
noted that these age effects might be harder to detect in blood than in other 
tissuses. 78 unique gender genes were identified including XIST, responsible for 
deactivating one of the X chromosomes in females in order to ensure dosage 
equivalence, and 23 genes mapped to the Y chromosome. Temporally regulated 
genes were identified by performing pairwise comparisons between Day 14 and 
Day 1, Day 28 and Day 14, and Day 180 to Day 90. This was partly due to con- 
cerns about a potential batch effect, as Days 1, 14 and 28 were processed in one 
batch, with Days 90 and 180 being processed in a second batch. No temporally 
regulated genes were identified in the Day 14 vs Day 1 or the Day 28 vs Day 
14 comparisons, but 248 probes were found to be differentially expressed when 
comparing Day 180 to Day 90, corresponding to 157 unique genes. 

Our proposed approach is to replace the original linear mixed-effects model 
with a functional one. The age and gender effects will be modelled as func- 
tions of time, along with the mean and individual curves. To avoid over- 
paramcterisation, all curves will be represented using smoothing splines. The 
result is a flexible model which permits the interaction of age and gender with 
time, if the data supports it. During our preprocessing we found little evidence 
of a batch effect and we will use the entire time course to identify temporally 
regulated genes, on the basis of the fitted mean function. 
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3. Methods 



We propose the following functional mixed-effects model for the data described in 
Section 2. Each gene is modelled independently. For a given gene, the observed 
gene expression level for individual i at time tij is given by 



where /i(-) models the mean expression levels across all individuals after account- 
ing for age and gender effects; afc(-) is the gender effect for gender k to which 
individual i belongs with k = {Male, Female}; is the age group effect for 
group I to which individual i belongs where I = {Young, Old}; 7i(-) is the indi- 
vidual specific effect for individual i and Cij is an error term. The functions /u(-), 
Q^/s(")) and 7j(-) are assumed to be smooth functions of time which we wish 
to estimate based on the noisy observations. We treat /i(-), ak{') and /3;(-) as 
fixed-effects, unknown population functions to be estimated, and the 7j(-) func- 
tions which are treated as random-effects as they represent a random sample of 
functions from the population as a whole. Formally, the 7i(-) are assumed to be 
i.i.d. realisations of an underlying Gaussian Process with mean and covariance 
function 5{r, s). 

The functions can be parameterized in a number of ways but we favour 
smoothing splines as these offer a fine degree of control over the amount to 
which the data is smoothed. Writing the vector of all observed time points for 
individual i as ti = [tii,ti2, ■ ■ ■ ,tin.]'^ where is the total number of observa- 
tions on individual i, (3) can be written in matrix form as 



where = ?/<(<j2), • • • ViiUn,)]'^ and = [e^, £42, • • • , CmJ^ are vectors 

of length Hi and /Lt = [/x(ti), /i(r2), • • • ,yu(TM)]^ is a vector of length M. The 
vectors a^, 0i and ji are defined similarly to fi. The values n , r2, • • • , tm denote 
the distinct design time points, of which there are M in total, and ti may differ 
from these may differ if individual i has missing data or duplicate observations 
for some time points. The matrix Xi is an incidence matrix of dimension riiX M 
where each row Xij contains all zeroes aside from the column m where tij = t„i- 
Further details on forming the incidence matrices and an example can be found 
in Appendix A.l. Recall that ~ GP{0,6),i = l,---,n, then the vectors 
7i are multivariatc-nornially distributed with mean and covariance matrix D 
where D{r,s) = 6{Tr,Ts). Similarly the noise term ej is multivariate-normally 
distributed with mean and covariance matrix and we assume that the 
vectors 7^ and arc independent. For simplicty we assume that Ri = a'^Imxrii, 
although a more complicated structure could be modelled at the expense of 
fitting more parameters. It is further necessary to impose the identifiability 
constraint that the age and gender fixed-effects for the two groups sum to zero, 
i.e. ctmaie + OL female = and (iyoung + f^oid = 0. For simplicity, therefore, we 



(3) 



yi = XiH + XiOtk + XiPi + Xiji + ej 



(4) 
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model a single gender and age effeet, cx = a female and f3 ~ (3oid respectively. 
These constraints can equivalently be expressed be rewriting (4) as 

yi = XiH + Wicx + Zif3i + Xifi + ej (5) 

where 

^ _ f —Xi if i is male Z — i ^ young 

* \ Xi if i is female * \ -X"i if i is old 

Let r] = [/It, a,/9]^, then (5) can be rewritten more compactly as 

yi = X*r] + Xai + €i 

where 

X* = [ X, ] 

Finally, the complete data vector for all individuals, y, can be expressed as 

y = X*r] + Xj + e (6) 

where y = [yi ,y2 ^' " ^UnV is an = ^^^ni length vector, and 7 and e 
are similarly defined, X* = [Xl'^' .XX,'^ , ■ ■ ■ X*^]^ is an A/' x 3M matrix and 
X = diag{Xi, X2, • ■ • , Xn) is an TV x nM matrix, with the diag{-) operator 
denoting a block diagonal matrix. The vectors 7 and e axe both multivariate- 
normally distributed with mean and covariance matrix D = diag{D, ■ ■ ■ , D) 
and R = diag{Ri,R2, ■ ■ ■ , Rn) respectively. 



3. 1. Parameter Estimation 

Model (6) is in the form of the standard linear mixed-effects model (Laird and 
Ware, 1982). Standard practice for obtaining estimates of the fixed- and random- 
effects, 17 and 7i, i = 1, ■ ■ ■ ,n would be to maximise the joint likelihood of rj and 
7j (Robinson, 1991). This is equivalent to minimising the following generalized 
log likelihood (GLL) criterion 

GLL ={y- X*ri - X-ffR-\y - X*ri - X7) + log \D\ (7) 
+j^D-'^j + log\R\ 

However, in our model the fixed- and random-effects are the fitted values of the 
smoothing spline estimates of the functions /i(-), ct{-), /?(•), 7i(-), i = 1, • • • , n, and 
it is necessary to incorporate a penalty term for the roughness of the smoothing 
splines into the likelihood. The penalized GLL is then given by 

POLL = GLL + E j/''^^*"^*^^''^^} + ^j\"ii)Ydt (8) 

+A f\a"{t)]^dt + X j\p"{t)Ydt 

J a J a 
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where the integrals quantify the roughness of the curves /i(-), q(-)j /?(■)• "/'(Oi * 
1 . ■ ■ ■ . in terms of their squared second derivative, although other penalties 
could be used. The scalars A and A-y are positive-valued smoothing parameters 
that control the roughness of the fit. For a given smoothing spline fit, A = 
would correspond to an interpolation of the data points while as A tends to 
infinity, the fit tends to a straight line. Note that the same smoothing param- 
eter A is used for the three fixed-effects functions, /i(-), c({-), simi- 
larly the same smoothing parameter, A.y, is used for all random-eflFect functions 
7j(-), z = 1, • • • , n. This is conceptually justified as each function ji is assumed 
to be a realisation of the same underlying Gaussian Process, but it is possible to 
envisage selecting a separate smoothing parameter for each fixed- and random- 
eff'ect function, albeit at the expense of a far greater computational cost. 

Minimization of (8) requires calculation of the integral of the squared sec- 
ond derivative of the fixed- and random-eflfects. In the case of cubic smoothing 
splines, for a given function f{t) observed at time points ti,t2,- " :tn such that 
/ = [f{ti), .f{h), • • • 1 .f{tn)]'^, there is a roughness m,atrix G which can be cal- 
culated in a computationally efficient manner that satisfies: 

(\f"{t)fdt = fGf 

J a 

this result can be found in Green and Silverman (1994) and we have reproduced 
the (kirivation in Appendix A. 2 for completeness. Incorporating the roughness 
matrix into (8) gives 

n 

PGLL = GLL + X^^-fJ'Gji + X{n'^GiJ, + a^Gcx + l3'^Gl3) 

i=l 

= GLL + A^7^G7 + Xr^G^r) 

where G is a block diagonal matrix comprised of the matrix G repeated n times. 
Similarly, G* is a block diagonal matrix comprised of G repeated three times. 

After a rearrangement on the terms featuring in the penalised log-likelihood, 
the model can be re-written in terms of the regularised covariance matrices 
= {D~^ + X-yG)~^ and V = XD^X^ + R, so called because the matrix 
is obtained by regularising the covariance matrix D with the term X-^G. 
This method of imposing the smoothness constraints by regularisation of the 
covariance matrix can be credited to Wu and Zhang (2006). 

Minimising (8) gives the BLUE and BLUP of the fixed- and random-effects 

as 

Ti={X*^V-'^X* + XG*)-^X*^V-'^y (9) 
J = D^X^V-\y - X*r,) (10) 

The discussion thus far has assumed that the variance components D and cr^ were 
known. Of course, in practical applications this will not be the case. Assuming 
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the random-cfFects 7^ and error terms e are known, the maximum hkehhood 
estimators D and ct^ are given as 

^=^E7.7f ^^ = l^e-e (11) 

As the random-effects 7^ and error terms are not, in fact, directly observed, we 
resort to the Expectation-Maximisation algorithm where they can be treated as 
missing data. In this procedure the sufficient statistics of D and ct^ - 'yi'yf,i = 
l,---,n and e^e respectively - are replaced by their conditional expectations 
which are calculated at the E-step. In the M-step, the maximum likelihood 
estimators are then calculated having replaced the sufficient statistics by these 
conditional expectations, which are given by 

Ebal\y,ri = ri] = ^aj + D,-D,X'[vr^X,b^ (12) 
E[e^e\y,ri = fi\ = e + a'^N - aHr{V-^) (13) 

where tr{-) denotes the trace of a matrix and Vi = XiD^Xf +a^Inixni ■ Deriva- 
tions of these conditional expectations are given in Appendix A. 3. 

3.2. Model Selection 

Thus far we have treated the smoothing parameters A and A-^ as fixed. In 
reality, optimal values of these parameters must be found using a model selection 
procedure. Guo (2002) made use of the relationship between a smoothing spline 
and a linear mixed-effects model in order to treat the smoothing parameters 
as variances components that could be estimated during the normal course of 
the EM-algorithm. We prefer, however, to dissociate the model selection from 
parameter estimation and numerically optimise over the two dimensional space 
of non-negative reals (A x A^) as this is a much more flexible approach. There 
are a number of different criteria for scoring the smoothing parameters, all of 
which essentially trade off between model fit and model complexity. 

Ma et al. (2006) 's smoothing-spline clustering approach for microarray data, 
for instance, employed Wahba (1977) 's generalized cross validation (GCV) cri- 
terion. It is well known, however, that GCV tends to undersmooth (Lee, 2003). 
Alternatively, we can employ either the Akaike Information Criterion (AIC) or 
the Bayesian Information Criterion (BIC): 

AIC{X,X^) = -21ik-F2df 
BIC{X, Xj) = -21ik + log{N)d{ 

These two criteria both score the smoothing parameters in terms of the likelihood 

- measuring the model fit - adjusted for a penalty term for the model complexity, 
in terms of degrees of freedom. The difference lies in the size of the penalty term. 
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with BIC giving more conservative results when log{N) > 2, in other words when 
there are more than 9 data points. 

Both of these criteria, and GCV, have a sound theoretical basis. We suggest, 
therefore, to choose which one to use on the basis of a priori knowledge about the 
kind of patterns we expect to observe in a given data set. If, as in our example 
data set, we do not expect there to be many genes with curvy temporal profiles, 
then we may prefer the more conservative BIC. On the other hand, in a data 
set with a greater number of time points and with more expected variability - 
in response to infection for instance - then we may prefer the AIC in order to 
better capture the more complex patterns expected. 



3.2.1. Smoother Matrices 

In order to evaluate the criteria, it is necessary to calculate the degrees of freedom 
of the model. As per Buja et al. (1989), the degrees of freedom associated with 
the fixed- and random-effects, ij and 7, can be expressed as the trace of some 
smoother matrix A such that y = Ay. Equivalently, it is useful to determine 
the two smoother matrices A = A^ + A^ so that the degrees of freedom of the 
fixed- and random-effects can be accounted for separately. 

Recall that the fitted values of the fixed-effects at the design time points can 
be written as X*ri. Replacing 17 with (9) gives 

X*J7 = X*{X*'^V-'^X* + AG*)-iX*^V-iy = A^y 

and so the smoother matrix is given by 

A^ = x*{x*^v-^x* + AG*)-ix*'^y-i 

Similarly, the fitted values of the random-effects at the design time points can 
be written as X7, which gives 

Xj = XD^X'^V-HIn - Ar,)y = A^y 

The degrees of freedom of the model can then be calculated as df = tr{Ari + 
Ary) + 1, which is the trace of the smoother matrix plus an additional paramter 
for fitting the noise variance a^. 

With the scoring function in place any kind of two-dimensional optimisation 
routine can be used, although in practice a simple grid search or sequential line 
optimisation is recommended (Wu and Zhang, 2006) . Wc have found that a more 
sophisticated simplex-search optimiser (Nelder and Mead, 1965) can be employed 
without incurring a significant computational cost. This allows optimisation 
over the two smoothing parameters A and A-y simultaneously without needing to 
calculate the derivative of the criterion. 
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3.3. Confidence Bands 

Pointwise confidence bands at the design time points for each of the fixed-effects 
functions can be determined either theoretically or using a bootstrap resampling 
procedure. In the case of the former, we have 

covirj) = {X*'^V-'^X* + AG*)-iX*^V-iX*(X*^V-iX* + XG*)-'^ 

The diagonal elements of cov{fi), therefore, give the variance of the fixed-effects 
at the design time points with the first M elements corresponding to /x(-), the 
next M elements to a(-), and the final M elements to /3(-). In fact, due to the 
block diagonal structure of covifi), these M elements will be the same across 
all three fixed-effects. Confidence bands for a significance level a at the design 
time points Ti can then be calculated for (i as fi{Ti) ± ZyJcov{jl{Ti)), where z 
is the critical value under the normality assumption such that 4){z) = 1—^0;. 
These bands can be calculated for the other fixed-effects d and (3 in an identical 
fashion. 

Alternatively, confidence intervals can be estimated by resampling the between- 
and within-individual residuals. To construct a bootstrapped sample for a single 
individual, first one of the individual functions 7^ is randomly selected and eval- 
uated at the design time points - denote this vector as 7*. Next, M residuals 
from the noise vector e, are resampled with rcplaccirncnt, writing this vector as 
e*. Then, the bootstrapped observation vector y* is given by 

y* = /X + a* + /3* + 7* -I- e* 

where a* = a if the individual is female and —a otherwise, similarly for (3. This 
process is then repeated for n individuals, sampling the individual functions with 
replacement, to give a complete bootstrapped data set. The model is then fit to 
this resampled data and new estimates for the fixed-effects obtained. Repeating 
this process for a large number of iterations gives a large number of fixed-effects 
estimates from which the confidence bands at a given significance level can be 
determined empirically. 

3.4. Hypothesis Testing 

Fitting model (6) allows us to separate out the mean, age and gender effects 
for each gene. It is then possible to determine whether there is a significant 
difference between age groups or genders for a given gene by testing the null 
hypothesis that the size of the effect is zero. As the effects are modelled as 
functions, a natural way to quantify their size is the L2 norm and testing the 
significance of the age effect for a given gene can be framed as 

ifo:||a(-)ll2 = H,:\\a{-)\\2>0 

Assessing the significance is complicated by the fact that the sample sizes are 

small and the null distribution is unknown. Instead, it can be determined empir- 
ically by using a data resampling scheme such as the bootstrap or permutation 
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procedure. For example, in Storey ct al. (2005), the mill distribution of their 
F-type test-statistic was determined using a nonparametric bootstrap procedure 
by resampling the individual effects and error terms with replacement. Here, 
however, the null distribution of the L2 norm of the age and gender effects can 
be estimated empirically using a permutation procedure where the class assign- 
ments - male/female or young/old - are randomly permuted. 

These same ideas can be applied when testing for genes which are temporally 
regulated. In this case, the null hypothesis that there is no change over time 
can be formulated as ||/u'(-)||2 = where /i'(-) is the first derivative of the mean 
curve. The null distribution can then be empirically estimated by randomly 
permuting the time points. 

3.5. Functional Principal Components Analysis 

Fitting model (6) to each gene yields a set of mean curves iJ,i{t),i = 1,---,G 
where G is the total number of genes in the data set. Performing a fPCA on this 
set of curves allows us to identify the main patterns of variation across all genes. 
A straight-forward way of doing so is the discretisation method described in Ram- 
say and Silverman (2005) which is essentially a two-stage approach to fPCA: (1) 
the data are smoothed by fitting model (6) to each gene (2) a fPCA is performed 
on the smoothed data - in the form of the set of curves iii{t),i = 1, • • • , G. Alter- 
native methods of fPCA such as James et al. (2000), which estimate and smooth 
the PCs directly, cannot be applied in this case where there are two levels of 
variation - the between and within-gene. 

First, each curve is discretised on a fine grid of n equally spaced points across 
the range of the time course. If there are N curves in total, this yields a data 
matrix X, of dimension N x n. A standard PCA can then be performed on X. 

This procedure gives n principal components, each a vector of length n. It is 
then necessary to transform these vectors back into functions. A standard PCA 
can be defined as solving the eigenequation 



where V = N~^X'^X is the sample covariance matrix of X, A is one of the 

eigenvalues of V, and u is one of the eigenvectors, or principal components. In 
the functional setting, we replace V by a covariance function v{s,t), and tt by a 
function of s, ^(s) such that the eigenequation (14) becomes 



for a given value of s. Noting that after discretisation of the curves the elements 

of the matrix V = v{sj,Sk) where j and k are any of the n discretised points 
on the fine grid, the integral in (15) can be approximated as a summation such 



Vu = Xu 



(14) 




(15) 
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that 



/ v{s, t)i{t)dt = w ^ v{s, Sk)ik 
fc=i 

where w is the spacing between the points on the fine grid, and are the 
discretised values of the function S,{s). The approximate discrete form of the 
functional eigenequation is therefore 

wVi = pi 

which corresponds to (14) with p = wX. Assuming the eigenvectors obtained 
from the standard PCA have been normalised, the equivalent functional con- 
straint that / ^(s)^(is = 1 is achieved by enforicing wH^lP = 1. The function 
^(•) is then recovered by interpolating the points ^. Assuming the grid is fine 
enough, the choice of interpolation method is irrelevant. 

As with a standard PCA, we will wish to retain only a small number of func- 
tional PCs. As is standard practice, the eigenvalues p can be used to facilitate 
this choice, by retaining enough PCs to explain most of the variation in the data. 
Assuming K PCs are retained, for curve i we have 

K 

yi{t) = n{t) +Y^Kiuik{t) + ei{t) 

k 

where Ktk are the PC loadings for curve i. These can be estimated by minimising 
the residuals yi{t) — Kikik{t), which in practice again requires discretisation 
of the curve i, and the PCs ^k{t)- 

4. Results 

We fit the functional mixed-effects model described in Section 3 to the example 
data set described in Section 2, independently for each probe. Convergence 
of the EM algorithm was confirmed by convergence of the variance components 
estimates ct^ and D and typically took around 30 iterations. 100 iterations of the 
simplex optimisation procedure were used to select the smoothing parameters. 
After obtaining estimates of the mean, age and gender effects, and individual 
curves, these were assessed for significance. To relieve some of the computational 
burden, permuted null test statistics were shared across all genes - theoretical 
results justifying this pooling can be found in Storey et al. (2004). Each gene 
was permuted 32 times, yielding in excess of 1 million null test statistics for each 
comparison. Prom these null distributions, empirical p-values were calculated, 
which were then corrected for multiple testing using the procedure of Benjamini 
and Hochberg (1995) to control the FDR at 10%. 
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After applying multiple testing corrections, no significant age genes were 
identified, as in the original analysis. 21 probes were found to be gender specific. 
Two of these 21 probes can be found on the Y-chromosome but are not mapped to 
any known genes. The remaining probes correspond to 7 known genes and 2 open 
reading frames, given in Table 2. Aside from XIST which, as discussed in Section 
2 is only expressed in females and is responsible for X-chromose inactivation to 
facilitate dosage equivalence between the sexc^s, all significant genes and the two 
open reading frames are found on the Y-chromosome. 

The highest ranked gender-effect gene on an autosomal chromosome was 
found to be TUBB2A, located on chromosome 6 and ranked number 23, with 
an associated FDR of 13%, hence of borderline significance. The gene and fitted 
mean and gender-effect curves is plotted in Figure 3, where a definite difference 
between the two groups is apparent, corresponding to between a 3- and 4-fold 
difference in expression levels. 

299 probes were found to be significantly temporally regulated, corresponding 
to 183 unique, mapped genes. The highest ranking gene was found to be MBP 
- myelin basic protein - given as one of the examples in Figure 5. Myelin is 
an insulating sheath covering nerve cells, essential for the correct functioning 
of the central nervous system and degredation of myelin can be found in many 
neurodegenerative diseases such as multiple sclerosis. It is thought that MBP 
might function to maintain the correct structure of myelin, which may explain 
why we found it to be seasonally regulated, although we could find no existing 
evidence of this. 

We performed a functional PCA of the gene mean curves. Each curve was 

discretised into 1,000 equally spaced points, then normalised by subtracting the 
first observation from the rest of the points. Thus, each curve represents the 
change in expression levels over time, relative tot = 0. The first two PC functions 
are given in Figure 6. The first PC accounts for 99.4% of the variation and 
corresponds to a linear change in expression levels over time. The second PC 
accounts for 0.5% of the variation and describes expression levels which rise over 
the first threee months before falling for the next three months, or vice versa. As 
these two PCs represent almost all of the variation in the curves, we estimated 
the loadings for each gene and plotted the results in Figure 4. Four outliers have 
been highlighted and each of these is plotted in Figure 5. It can be seen that 
the outliers in the loadings plot correspond to those genes which change most 
over time, with the distinctive line of points in the center corresponding to genes 
which change linearly. For these genes with linear dynamics, the size of the first 
PC loading is relative to the slope. Genes which can be separated on the y-axis 
are those with a quadratic temporal profile. 

5. Discussion 

A number of different models have been proposed in the literature for the analysis 
of microarray time series data. One of the earliest examples of a FDA approach 
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(a) (b) 



CO 




5 6 7 8 9 10 50 100 150 

Fitted values Day 




Fig. 2. Residual analysis for the TUBB2A model fit. (a) Standardised residuals against 
fitted values (b) Standardised residuals against time (c) Standardised residuals against 
observations (d) QQ-plot of standardised residuals. These plots can be used to detect 
patterns in the data which the model has failed to capture. Aside from the obvious group- 
ings as a result of the difference in gene expression levels between males and females, 
there appears to be little structure to the residuals. In all cases, the triangles correspond 
to observations on subject 174, who developed lung cancer during the course of the 
study and died prior to the final time point. It can be seen that this subject contributes 
two obvious outlying residuals, which may have negatively impacted the goodness of fit 
criteria calculated by Karlovich et al. (2009), possibly resulting in its removal from any 
subsequent analysis 
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TUBB2A 




Fig. 3. Plot of TUBB2A's fitted longitudinal profiles. We have identified TUBB2A as 
a gene with a potentially novel gender effect. Observations on females are shown as 
squares, and those on males are shown as circles. The solid line is the overall mean 
expression level over time, after removing age and gender effects. The dotted line is the 
mean plus gender effect for females, and the dashed line is the mean plus gender effect 
for males 
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Fig. 4. Functional principal components analysis loadings plot. Two functional principal 
components capture 99.9% of the observed variation in the fitted mean curves for each 
gene. The loadings on the first principal component function corresponds to the x-axis, 
which represents linear variation over time. The second principal component function 
captures variation which is of a more quadratic nature. These two principal component 
functions are given in Figure 6. Four outliers representing the spectrum of observed 
temporal profiles have been highlighted; individual plots for these genes are given in 
Figure 5 
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Principal Component Function 1 




Principal Component Function 2 




1 14 28 90 180 

Day 



Fig. 6. Two principal component functions which explain 99.9% of the variation observed 
In the fitted mean curves for each gene. The first principal component describes a linear 
relationship with time. The second principal component captures a more quadratic fit 
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Table 1. 19 probes found to be significantly differentially ex- 
pressed according to gender by Karlovich et al. (2009), with a 
mean log-transformed signal intensity greater than or equal to 7 
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to the modelling of microarray time series data was Bar-Joseph et al. (2003) 
which dealt with the issue of clustering unrcplicatcd data. In their model, the 
curves were parameterised using B-splines and functional mixed-effects models 
were used to estimate the cluster mean curves and model the within-cluster 
variability. In their approach, the function /i(-) in (2) represents a given cluster's 
mean, and the functions fi{-),i = l,---,n represent the temporal profiles of 
each of the genes belonging to this cluster, of which there are n. A specialised 
EM algorithm was used to handle dynamic cluster assignments. A very similar 
approach was developed independently by Luan and Li (2003). 

A problem with Bar- Joseph et al. (2003); Luan and Li (2003) is that the 
B-splinc parameterisation of the curves requires selecting both the mimber and 
location of the knots - breakpoints for the piecewise polynomials - which control 
the overall smoothness of the fitted curve /(•). As the total number of knots 
is limited by the mimber of time points, there is limited scope for controlling 
the smoothness of the fit. Furthermore, each curve was parameterised using the 
same number of knots which may be unable to fully capture the wide range of 
temporal profiles we are likely to observe. Ma et al. (2006) set out to resolve these 
issues with their alternative framework for clustering. In their model, the cluster 
mean curves - //(•) in (2) - are represented using smoothing splines, which place a 
knot at each design time point and use a roughness penalty to avoid fitted curves 
which are too 'wiggly'. One drawback to their approach, however, is that the 
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Table 2. 21 probes found to have a significant gender-effect Aside from 
XIST, all of Vnese probes can be found on the Y-chromosome. Q-value 
indicates the corresponding false discovery rate (FDR) if a particular gene 
is taken to be the cut-off between significant and non-significant. 



Gene Name 


Chromosome 


AflFymctrix ID 


L2 norm 


q- value 


XIST 


X 


224588_at 


57.2 


0.00248 


XIST 


X 


224590_at 


53.9 


0.00248 


EIFIAY 


Y 


204409_s_at 


48.8 


0.00248 


RPS4Y1 


Y 


201909_at 


42.9 


0.00248 


DDX3Y 


Y 


205000_at 


36.7 


0.00248 


XIST 


X 


214218_s_at 


35.4 


0.00248 


EIFIAY 


Y 


204410_at 


34.5 


0.00248 


XIST 


X 


221728j<:_at 


33.2 


0.00248 


CYorflSB 


Y 


214131_at 


30.3 


0.00248 


CYorfl5A 


Y 


232618_at 


29.2 


0.00248 


USP9Y 


Y 


228492_at 


27.8 


0.00248 


JARIDID 


Y 


206700_s_at 


25.3 


0.00248 


XIST 


X 


224589_at 


24.8 


0.00248 




Y 


244482_at 


22.4 


0.00430 


XIST 


X 


227671_at 


22.2 


0.00430 


TSIX 


X 


231592_at 


18.4 


0.0247 


BCORL2 


Y 


1562313_at 


18.4 


0.0247 




Y 


1560800_at 


16.3 


0.0323 


DDX3Y 


Y 


205001_s_at 


16.1 


0.0543 


CYorflSB 


Y 


223646_s_at 


14.1 


0.0597 


CYorfl5A 


Y 


236694_at 


13.8 


0.0845 
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individual functions fi{-),i = 1, • • • ,n arc only modelled as scalar shifts rather 
than smooth curves. This leads to a more parsimonious model which avoids 
fitting too many parameters but may fail to adequately model the within-cluster 
variability. 

Angelini et al. (2009) adopt a fully Bayesian approach to estimation and 
testing in unreplicated or cross-sectional microarray data sets. Each gene is 
represented using Legendre polynomials. Three choices for a prior on the noise 
variance allows for errors which are marginally normal, Student t or double 
exponentially distributed, although tr^ is assumed the same for all genes. This 
assumption is unlikely to hold in practice, as a correlation between gene expres- 
sion intensity and measurement noise is well known (Tusher et al., 2001). Given 
the fully Bayesian framework, hypothesis testing for differences in expression 
levels across two biological groups is performed using Bayes Factors. 

A handful of models and computer packages have also specifically been sug- 
gested to model longitudinal data. For instance, Timecourse is an R package 
based on Tai and Speed (2006). where multivariate analysis techniques arc ap- 
plied directly to the vectors of observations. This treatment of time as an un- 
ordered categorical variable - found also in ANOVA approaches as in Wang and 
Kim (2003) - has some significant drawbacks. In particular, the method cannot 
handle missing data, the results obtained by an analysis would be invariant to 
permutation of the time points, and it is assumed that the time points are reg- 
ularly spaced. Furthermore, this method only ranks the genes with no guidance 
given as to how to evaluate significance. 

The EDGE method of Storey et al. (2005) is a FDA approach to mod- 
elling both longitudinal and cross-sectional microarray data. In their method 
for longitudinal data analysis, each gene is modelled independently as a sepa- 
rate functional mixed-effects model. The mean curve - ^(•) in (2) - is modelled 
as a B-splinc while the individual effects are treated as scalar shifts as in Ma 
et al. (2006). A complete framework for detecting genes differentially expressed 
across two or more biological groupings is presented, with the model estimation 
performed by an EM algorithm. Differential expression is quantified using an 
F-type statistic which compares the residuals of a null model where the biologi- 
cal groupings are ignored to an alternative model where the groupings are taken 
into account. Significance is assessed by using a resampling bootstrap procedure 
to estimate the null distribution of this F-type statistic, and the multiple testing 
problem is handled by analysing the empirical p-value histogram (Storey and 
Tibshirani, 2003) to estimate the positive false discovery rate. 

Another way of accounting for the within-gene variance is to perform a func- 
tional principal components analysis (fPCA). This is analagous to the standard 
PCA, except the principal components (PCs) are functions rather than finite 
dimensional vectors. There have been a number of different methods suggested 
for estimating the PCs in a functional context including direct estimation in a 
mixed-effects model framework (James et al., 2000), standard PCA on discre- 
tised curves (Ramsay and Silverman, 2005) and 'Principal Components Analysis 
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through Conditional Expectation' (PACE) (Yao ct al., 2005). It is this latter 
approach which Liu and Yang (2009) applied to the analysis of microarray data; 
however, PACE was originally proposed for data where the observations on each 
individual are taken at different time points - for example, in the case of growth 
curve data - in our experience, microarray experiments tend to have much more 
regular designs, with each individual observed at the same time points, although 
these may, indeed, be unequally spaced. 

Some key shortcomings of these methods should be noted. Firstly, none 
of the methods can incorporate the gender and age covariates, particularly as 
functions of time. Secondly, all of these approachs either use B-splines and/or 
model the individual 'functions' as scalar-shifts, both of which lead to inflexible 
models. Finally, we are not aware of any existing methods which address the 
issue of modelling both the within- and between-gene variation. Our proposed 
methodology addresses some of these limitations. 

Our results related to the case study presented in section 2 can be compared 
to the original findings of Karlovich et al. (2009), who used a non- functional 
mixed-effect model. Karlovich et al. (2009) lists 19 probes they detected as 
having a significant gender effect and with a log-transformed signal intensity 
greater than 7, which we have reproduced here in Table 1. No justification for 
this cut-off of 7 is provided, and this filter gives misleading results. For instance, 
all of the significant gender genes we have identified fail to meet the cut-off. 
This is because the mean log-transformed signal intensity is taken across both 
genders, and all of our genes aside from XIST are found on the Y-chromosome 
and hence completely unexpressed in females. 

We were unable to find any confirmation in the literature that TUBB2A is 
a sex-related gene, and it does not appear in the 15 probes given by Karlovich 
et al. (2009). With a mean log-transformed signal intesity of 7.4, it meets their 
cut-off criteria. It is possible that they removed the probe from their analysis 
if the residuals from their model were found to be non-normally distributed. 
Indeed, the unadjusted p-value for the Shapiro- Wilk test on the residuals of our 
model for this probe is 2.54e — 5. However, looking at the residual analysis 
plotted in Figure 2, it is easy to see that there is one very large outlier. This 
observation corresponds to subject 174 at Day 90. Subject 174 is the individual 
who developed lung cancer between days 28 and 90, and died prior to day 180. If 
this observation is removed then the unadjusted Shapiro- Wilk p-value is 0.297, 
and the null hypothesis that the residuals are normally distributed is no longer 
rejected. Hence, TUBB2A may indeed be a novel gender regulated gene. 

The number of temporally regulated genes we identified are consistent with 
Karlovich et al. (2009), although their method for identifying differentially ex- 
pressed genes is quite dissimilar to ours (see Section 2). Indeed, although they 
found 66 significant genes associated with apoptosis, we found only 15, suggest- 
ing the actual significant genes found may vary more widely than the numbers 
suggest. 
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6. Conclusions 



In this paper we have demonstrated a complete framework for the analysis of 

microarray time series data. The unique characteristics of microarry data lend 
themselves well to a functional data analysis approach and we have shown how 
this naturally extends to the inclusion of covariates such as age and sex. Our 
model presented here is a specialisation of the more general functional mixed- 
effects model (Rice and Wu, 2001; Guo, 2002) and, to the best of our knowledge, 
we are the first to show how to derive the maximum-likelihood estimators, EM- 
algorithm, confidence intervals and smoother matrix with more than one fixed- 
effects function. 

We were motivated by a real data set and we have aimed to improve upon 
the existing results with a more flexible model. By taking a roughness penalty 
approach, this is achieved while avoiding overfitting, allowing for a departure 
from the original linear mixed-effects model when the data permits it. A deeper 
biological interpretation is required to fully assess our success here, but the re- 
sults we have highlighted in this paper suggest that we can easily attach meaning 
to our findings. It may also prove worthwhile performing a comparative analysis 
with Eady et al. (2005), which is another, similar longitudinal study taken over 
a shorter period of five weeks. 



A. Appendix 

A. 1. Example incidence matrix 

In our example data set, there are 5 design time points: Day 1, 14, 28, 90 and 
180. Therefore, the incidence matrices for all individuals, Xi,i = 1, • • • ,n, all 
have 5 columns. The first column corresponds to observations at Day 1, the 
second to observations at Day 14 and so on. The rows correspond to the specific 
observations on a particular individual. If the individual is observed once at 
each design time point, then, assuming their vector of observations has been 
ordered according to the time points, Xi = I. 

Now consider the case of subject 174 who died prior to Day 180 and hence 
only contributed 4 observations at each of the remaining design time points. The 
design matrix for this individual has 4 rows, corresponding to the 4 observations, 
but still has 5 columns, corresponding to the design time points. Specifically the 
incidence matrix in this case is: 

"1 " 
10 

' ~ 10 
1 

Note how there is no 1 in the final column which would correspond to an obser- 
vation at Day 180. 
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A.2. Specification of rougtiness matrix G 

Green and Silverman (1994) show that there is a straight forward way to calculate 
the roughness matrix for a smoothing spline given the set of distinct time points 
Ti, • ■ ■ Tm- The roughness matrix is given as G = AB~^A^ where the matrices 
A and B are defined as follows. First calculate fir = r^+i — r^, r = 1, • • • , — 1, 
the differences between successive time points. Then matrix A is an M x (A/ — 2) 
matrix whose entries ar,s are given by 

0'r,r = hr 0,r+l,r = ~{fl'r ^ + ^r+l)' ^T-+2,r = '''r+l 

for r = 1, • • • , M - 2 and elsewhere. B is an (M - 2) x (M - 2) matrix with 
the entries given by 

I, hM-2 I. hM-2 + hM-l 

0M-3,M-2 — — e — , 0M-2,M-2 — 3 



A . 3. Derivation of conditional expectations 

We begin by first considering the posterior expectation of 7i7?' which, using 
basic properties of expectations, can be rewritten as: 



E 



1 



i=l 



n 



1 



E[iaJ\y,ri = 'fi\ 



i=l 



The definition of covariance allows us to write: 

E [iaj\y, ?7 = 17] = E [-fi\y, r} = r)\E [7^ |y, r) = r)] 
+ Cov{nfi\y,r] = rj,f'[\y,r] = ri) 

The problem is now to determine the mean and covariance of 7i|j/, for which we 
use a standard result conerning the multivariate normal distribution (See, for 
example, Anderson, 1958) which says, for any vectors Xi and X2 distributed as 
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Ml 






V12 ■ 
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. *2 






. V21 


V22 . 





the conditional distribution of Xi\x2 is given by 

Xl\X2 ~ Mini + Fi2F22^(X2 - /i2), Vu - ^12^22^^21] 

If we let Xi = J and X2 = y, and derive the covariance of 7 and y as Cov{'j, y) = 
D^X^ then we have 



7 

y\ri = r) 





Xf, 



„J2 
XD 



V 



7|y, r/ = 57 ~ J\f[D^X'^V-\y - X17), - D-yX^V-^XD^ 
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Recognising that, because and V are block diagonal and XD-yV ^{y — 
Xrj) = 7, we have 

7i|y, ^ = ^ ~ J^ili, - D^XjV-^XiD^] 
and we can now write 

E[iaJ\y,v = 'n] = ^aJ + [d^ - D^xJv-^XiD^] 

For the posterior expectation of ct^, we follow exactly the same approach, writing 



e 









R 


R 




y\ri = fi 






R 


V 


) 



e|y, ,7 = r) ~ M[RV-^{y - Xf)), R - RV-^R] 
ei\y,v = 'h^N[RiVr\yi - Xifi),Ri - RiVr^Ri] 

Note that 

RiVr\yi-Xifi) = {Vi - XiDXl)Vr\yi - Xifi) 
= {I - X,D^XlVr^){y, - X.'n) 
= {yt - XiTj) - XiD^XjVr\yi - Xif,) 

= yi- Xifi - Xi-ji 

= ii 



Ri-RiVr^Ri = a^In,-a^V-' 
= a\lr., - a'Vr^) 

and using the identity 

E[ejei\y,ri^fi\ = tr{E[e,eJ\y,ri ^ fi]} 
allows us to derive 
E[ejei\y,'n = ii\ = tr{E[e^eJ\y,r] = rj]} 

= tr{E[ei\y,ri = fi]E[eI\y,V = v]+(r\ln, - cr^Vr^)} 
= tr{e,ef+a'{I^,-a'V-')} 
= tr{e,ef} + tr{a\ln^ - a^Vf^)} 
= efe, + <j^tr{In,}-ahr{V-^}) 
= efei + a^n^-aHr{V-^}) 

and so 

n 

E[e^e\y,v = v] = Y.i^fe, + a'{ni-aHr{Vr^})] 
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